12 research outputs found
Diffusion Video Autoencoders: Toward Temporally Consistent Face Video Editing via Disentangled Video Encoding
Inspired by the impressive performance of recent face image editing methods,
several studies have been naturally proposed to extend these methods to the
face video editing task. One of the main challenges here is temporal
consistency among edited frames, which is still unresolved. To this end, we
propose a novel face video editing framework based on diffusion autoencoders
that can successfully extract the decomposed features - for the first time as a
face video editing model - of identity and motion from a given video. This
modeling allows us to edit the video by simply manipulating the temporally
invariant feature to the desired direction for the consistency. Another unique
strength of our model is that, since our model is based on diffusion models, it
can satisfy both reconstruction and edit capabilities at the same time, and is
robust to corner cases in wild face videos (e.g. occluded faces) unlike the
existing GAN-based methods.Comment: CVPR 2023. Our project page: https://diff-video-ae.github.i
3D-aware Blending with Generative NeRFs
Image blending aims to combine multiple images seamlessly. It remains
challenging for existing 2D-based methods, especially when input images are
misaligned due to differences in 3D camera poses and object shapes. To tackle
these issues, we propose a 3D-aware blending method using generative Neural
Radiance Fields (NeRF), including two key components: 3D-aware alignment and
3D-aware blending. For 3D-aware alignment, we first estimate the camera pose of
the reference image with respect to generative NeRFs and then perform 3D local
alignment for each part. To further leverage 3D information of the generative
NeRF, we propose 3D-aware blending that directly blends images on the NeRF's
latent representation space, rather than raw pixel space. Collectively, our
method outperforms existing 2D baselines, as validated by extensive
quantitative and qualitative evaluations with FFHQ and AFHQ-Cat.Comment: ICCV 2023, Project page: https://blandocs.github.io/blendner
Dual Attention GANs for Semantic Image Synthesis
In this paper, we focus on the semantic image synthesis task that aims at
transferring semantic label maps to photo-realistic images. Existing methods
lack effective semantic constraints to preserve the semantic information and
ignore the structural correlations in both spatial and channel dimensions,
leading to unsatisfactory blurry and artifact-prone results. To address these
limitations, we propose a novel Dual Attention GAN (DAGAN) to synthesize
photo-realistic and semantically-consistent images with fine details from the
input layouts without imposing extra training overhead or modifying the network
architectures of existing methods. We also propose two novel modules, i.e.,
position-wise Spatial Attention Module (SAM) and scale-wise Channel Attention
Module (CAM), to capture semantic structure attention in spatial and channel
dimensions, respectively. Specifically, SAM selectively correlates the pixels
at each position by a spatial attention map, leading to pixels with the same
semantic label being related to each other regardless of their spatial
distances. Meanwhile, CAM selectively emphasizes the scale-wise features at
each channel by a channel attention map, which integrates associated features
among all channel maps regardless of their scales. We finally sum the outputs
of SAM and CAM to further improve feature representation. Extensive experiments
on four challenging datasets show that DAGAN achieves remarkably better results
than state-of-the-art methods, while using fewer model parameters. The source
code and trained models are available at https://github.com/Ha0Tang/DAGAN.Comment: Accepted to ACM MM 2020, camera ready (9 pages) + supplementary (10
pages
Generating Images Instead of Retrieving Them : Relevance Feedback on Generative Adversarial Networks
Finding images matching a user’s intention has been largely basedon matching a representation of the user’s information needs withan existing collection of images. For example, using an exampleimage or a written query to express the information need and re-trieving images that share similarities with the query or exampleimage. However, such an approach is limited to retrieving onlyimages that already exist in the underlying collection. Here, wepresent a methodology for generating images matching the userintention instead of retrieving them. The methodology utilizes arelevance feedback loop between a user and generative adversarialneural networks (GANs). GANs can generate novel photorealisticimages which are initially not present in the underlying collection,but generated in response to user feedback. We report experiments(N=29) where participants generate images using four differentdomains and various search goals with textual and image targets.The results show that the generated images match the tasks andoutperform images selected as baselines from a fixed image col-lection. Our results demonstrate that generating new informationcan be more useful for users than retrieving it from a collection ofexisting information.Peer reviewe
DeepFacePencil: Creating Face Images from Freehand Sketches
In this paper, we explore the task of generating photo-realistic face images
from hand-drawn sketches. Existing image-to-image translation methods require a
large-scale dataset of paired sketches and images for supervision. They
typically utilize synthesized edge maps of face images as training data.
However, these synthesized edge maps strictly align with the edges of the
corresponding face images, which limit their generalization ability to real
hand-drawn sketches with vast stroke diversity. To address this problem, we
propose DeepFacePencil, an effective tool that is able to generate
photo-realistic face images from hand-drawn sketches, based on a novel dual
generator image translation network during training. A novel spatial attention
pooling (SAP) is designed to adaptively handle stroke distortions which are
spatially varying to support various stroke styles and different levels of
details. We conduct extensive experiments and the results demonstrate the
superiority of our model over existing methods on both image quality and model
generalization to hand-drawn sketches.Comment: ACM MM 2020 (oral
Retrieval Guided Unsupervised Multi-domain Image to Image Translation
none6siImage to image translation aims to learn a mapping that transforms an image from one visual domain to another. Recent works assume that images descriptors can be disentangled into a domain-invariant content representation and a domain-specific style representation. Thus, translation models seek to preserve the content of source images while changing the style to a target visual domain. However, synthesizing new images is extremely challenging especially in multi-domain translations, as the network has to compose content and style to generate reliable and diverse images in multiple domains. In this paper we propose the use of an image retrieval system to assist the image-to-image translation task. First, we train an image-to-image translation model to map images to multiple domains. Then, we train an image retrieval model using real and generated images to find images similar to a query one in content but in a different domain. Finally, we exploit the image retrieval system to fine-tune the image-to-image translation model and generate higher quality images. Our experiments show the effectiveness of the proposed solution and highlight the contribution of the retrieval network, which can benefit from additional unlabeled data and help image-to-image translation models in the presence of scarce data.noneRaul Gomez, Yahui Liu, Marco Nadai, Dimosthenis Karatzas, Bruno Lepri, Nicu SebeGomez, Raul; Liu, Yahui; De Nadai, Marco; Karatzas, Dimosthenis; Lepri, Bruno; Sebe, Nic
StarGAN v2: Diverse Image Synthesis for Multiple Domains
A good image-to-image translation model should learn a mapping between different visual domains while satisfying the following properties: 1) diversity of generated images and 2) scalability over multiple domains. Existing methods address either of the issues, having limited diversity or multiple models for all domains. We propose StarGAN v2, a single framework that tackles both and shows significantly improved results over the baselines. Experiments on CelebA-HQ and a new animal faces dataset (AFHQ) validate our superiority in terms of visual quality, diversity, and scalability. To better assess image-to-image translation models, we release AFHQ, high-quality animal faces with large inter-and intra-domain differences. The code, pretrained models, and dataset are available at https://github.com/clovaai/stargan-v2
Rethinking the Truly Unsupervised Image-to-Image Translation
Every recent image-to-image translation model inherently requires either image-level (i.e. input-output pairs) or set-level (i.e. domain labels) supervision. However, even set-level supervision can be a severe bottleneck for data collection in practice. In this paper, we tackle image-to-image translation in a fully unsupervised setting, i.e., neither paired images nor domain labels. To this end, we propose a truly unsupervised image-to-image translation model (TUNIT) that simultaneously learns to separate image domains and translates input images into the estimated domains. Experimental results show that our model achieves comparable or even better performance than the set-level supervised model trained with full labels, generalizes well on various datasets, and is robust against the choice of hyperparameters (e.g. the preset number of pseudo domains). Furthermore, TUNIT can be easily extended to semi-supervised learning with a few labeled data
Reliable fidelity and diversity metrics for generative models
Devising indicative evaluation metrics for the image generation task remains an open problem. The most widely used metric for measuring the similarity between real and generated images has been the Fr??chet Inception Distance (FID) score. Because it does not differentiate the fidelity and diversity aspects of the generated images, recent papers have introduced variants of precision and recall metrics to diagnose those properties separately. In this paper, we show that even the latest version of the precision and recall metrics are not reliable yet; for example, they fail to detect the match between two identical distributions, they are not robust against outliers, and the evaluation hyperparameters are selected arbitrarily. We propose density and coverage metrics that solve the above issues. We analytically and experimentally show that density and coverage provide more interpretable and reliable signals for practitioners than the existing metrics